Quantitative Methodology (UPF)
Load packages.
“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.
Source: R for Data Science
A df is tidy if it fulfills these requirements (Wickham 2014):
But …
Votes to parties in the 2019 Spanish elections
# A tibble: 15 × 9
provincia municipio votos…¹ pp cs podem…² mas_p…³ eb pacma
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Almería Abla 710 193 47 30 0 2 3
2 Almería Abrucena 699 111 45 42 0 0 2
3 Almería Adra 10941 3126 574 645 0 16 81
4 Almería Albánchez 272 56 41 18 0 0 2
5 Almería Alboloduy 419 155 17 10 0 0 2
6 Almería Albox 4335 1261 378 247 0 2 28
7 Almería Alcolea 509 92 39 43 0 1 11
8 Almería Alcóntar 371 125 20 15 0 2 3
9 Almería Alcudia de Monteag… 106 47 5 6 0 0 0
10 Almería Alhabia 409 95 35 23 0 0 1
11 Almería Alhama de Almería 1984 402 103 182 0 0 18
12 Almería Alicún 150 38 11 7 0 0 3
13 Almería Almería 92383 24495 7848 9699 0 204 964
14 Almería Almócita 103 10 6 16 0 1 0
15 Almería Alsodux 81 16 1 9 0 3 3
# … with abbreviated variable names ¹votos_validos, ²podemos_iu, ³mas_pais
But …
Minor contracts granted in Barcelona
# A tibble: 15 × 5
ens contracte data proveidor import
<chr> <chr> <date> <chr> <dbl>
1 CONSORCIS Serveis NA Marc Redorta 1646.
2 CONSORCIS Serveis NA JESUS SANZ LOPEZ 360
3 CONSORCIS Serveis NA JOAN BOSCH MUNT… 690.
4 CONSORCIS Serveis NA ALLIANZ, Compañ… 456.
5 FUNDACIONS I ASSOCIACIONS <NA> NA Vopi4Elecnor 1891.
6 FUNDACIONS I ASSOCIACIONS Serveis 2918-11-19 Jaume Badosa 1754.
7 ORGANISMES AUTÒNOMS LOCALS Serveis NA WHADS MEDIA STU… 1392.
8 ORGANISMES AUTÒNOMS LOCALS Serveis NA AJUNTAMENT BARC… 4655.
9 ORGANISMES AUTÒNOMS LOCALS Subministrament NA INADHOC HABITAT 9859.
10 ORGANISMES AUTÒNOMS LOCALS Serveis NA DOT CONSULTING 18090.
11 ORGANISMES AUTÒNOMS LOCALS Subministrament NA SAYTEL 15562.
12 ORGANISMES AUTÒNOMS LOCALS Subministrament NA ID GRUP 1197.
13 ORGANISMES AUTÒNOMS LOCALS Subministrament NA ID GRUP 2117.
14 ORGANISMES AUTÒNOMS LOCALS Serveis NA CODINA, ESTHER 1600
15 ORGANISMES AUTÒNOMS LOCALS Serveis NA ABELEIRA, ANTON… 1200
But …
Price of rents
# A tibble: 15 × 3
nom_barri preu preu_m2
<chr> <dbl> <dbl>
1 el Raval 590. 10.8
2 el Barri Gòtic 713. 10.6
3 la Barceloneta 541. 14.4
4 Sant Pere, Santa Caterina i la Ribera 673. 11.0
5 el Fort Pienc 736. 10.4
6 la Sagrada Família 673. 10.6
7 la Dreta de l'Eixample 921. 9.84
8 l'Antiga Esquerra de l'Eixample 828. 10.4
9 la Nova Esquerra de l'Eixample 716. 10.3
10 Sant Antoni 693. 9.77
11 el Poble Sec 568 10.2
12 la Marina del Prat Vermell NA NA
13 la Marina de Port 554. 8.34
14 la Font de la Guatlla 632. 10.8
15 Hostafrancs 581. 10.4
Income per capita
# A tibble: 15 × 3
nom_barri sc import_euros
<chr> <dbl> <dbl>
1 el Raval 1 27503
2 el Raval 2 21913
3 el Raval 3 24220
4 el Raval 4 27405
5 el Raval 5 23014
6 el Raval 6 25581
7 el Raval 7 25959
8 el Raval 8 21900
9 el Raval 9 24849
10 el Raval 10 23306
11 el Raval 11 21571
12 el Raval 12 25592
13 el Raval 13 27908
14 el Raval 14 26078
15 el Raval 15 26092
Data is untidy because is:
Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).
Table A
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Table B
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Table C
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Table D
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
NOO ANOTHER
Correlates of War Intra-State War Data
# A tibble: 442 × 4
WarName WarType SideA SideB
<chr> <dbl> <chr> <chr>
1 First Caucasus 5 Russia Georgians, Dhagestania, Chechens
2 Sidon-Damascus 6 Sidon Damascus & Aleppo
3 First Two Sicilies 4 Austria -8
4 First Two Sicilies 4 Two Sicilies Liberals
5 Spanish Royalists 4 Spain Royalists
6 Sardinian Revolt 4 Austria -8
7 Sardinian Revolt 4 Sardinia Carbonari
8 Greek Independence 5 Ottoman Empire Greeks
9 Greek Independence 5 -8 United Kingdom
10 Greek Independence 5 -8 France
# … with 432 more rows
PEW Research Religious Landscape Study1
# A tibble: 18 × 6
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76
2 Atheist 12 27 37 52 35
3 Buddhist 27 21 30 34 33
4 Catholic 418 617 732 670 638
5 Don’t know/refused 15 14 15 11 10
6 Evangelical Prot 575 869 1064 982 881
7 Hindu 1 9 7 9 11
8 Historically Black Prot 228 244 236 238 197
9 Jehovah's Witness 20 27 24 24 21
10 Jewish 19 19 25 25 30
11 Mainline Prot 289 495 619 655 651
12 Mormon 29 40 48 51 56
13 Muslim 6 7 9 10 9
14 Orthodox 13 17 23 32 32
15 Other Christian 9 7 11 13 13
16 Other Faiths 20 33 40 46 49
17 Other World Religions 5 2 3 4 2
18 Unaffiliated 217 299 374 365 341
We change the rows and columns of the dataframe keeping the same information.
pivot_longer(df, cols, names_to, names_to)
pivot_wider(df, names_from, values_from)
separate(df, col, into, sep)
unite(df, col, ..., sep)
NA (Not Available)
Related with the state capacity to collect them (Stone2008?)
Check the World Bank
Small or underdeveloped states
is.na()
Quantitative Methodology (UPF)